Skip to main content
Qubrid AI
Qubrid AI
NVIDIA ยท Chat / LLM ยท 31.6B Parameters (3.2B Active) ยท 262K Context (up to 1M)Qubrid Playground License HuggingFaceFunction Calling Tool Calling Streaming Reasoning Long Context Code

Overview

NVIDIA Nemotron-3 Nano 30B A3B BF16 is NVIDIAโ€™s flagship open reasoning model, featuring a revolutionary hybrid Mamba-Transformer Mixture-of-Experts architecture. With 31.6B total parameters but only 3.2B active per forward pass (10% activation ratio), it delivers up to 3.3ร— higher throughput than Qwen3-30B-A3B while achieving state-of-the-art accuracy on reasoning, coding, and agentic benchmarks. The model supports up to 1M token context length and features configurable reasoning depth with thinking budget control โ€” making it the most compute-efficient reasoning model in its class. Served instantly via the Qubrid AI Serverless API.
โšก 3.3ร— faster than Qwen3-30B-A3B. Only 3.2B active parameters. 1M token context. Deploy on Qubrid AI โ€” no VRAM, no cluster, no ops.

Model Specifications

FieldDetails
Model IDnvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
ProviderNVIDIA
KindChat / LLM
ArchitectureHybrid Mamba-Transformer MoE โ€” 23 Mamba-2 layers, 23 MoE layers (128 experts, 6 active), 6 GQA attention layers
Parameters31.6B total (3.2B active per forward pass)
Context Length262K Tokens (up to 1M)
MoEYes
Release DateDecember 15, 2025
LicenseNVIDIA Open Model License
Training Data25T tokens including 3T new unique tokens, 10.6T total with 33% synthetic data for math, code, and tool-calling
Function CallingSupported
Image SupportN/A
Serverless APIAvailable
Fine-tuningComing Soon
On-demandComing Soon
State๐ŸŸข Ready

Pricing

๐Ÿ’ณ Access via the Qubrid AI Serverless API with pay-per-token pricing. No infrastructure management required.
Token TypePrice per 1M Tokens
Input Tokens$0.04
Output Tokens$0.22

Quickstart

Prerequisites

  1. Create a free account at platform.qubrid.com
  2. Generate your API key from the API Keys section
  3. Replace QUBRID_API_KEY in the code below with your actual key
๐Ÿ’ก Reasoning mode: By default, chain-of-thought reasoning is enabled (enable_reasoning=true). Use thinking_budget to control the maximum reasoning token budget and manage inference cost.

Python

from openai import OpenAI

# Initialize the OpenAI client with Qubrid base URL
client = OpenAI(
    base_url="https://platform.qubrid.com/v1",
    api_key="QUBRID_API_KEY",
)

# Create a streaming chat completion
stream = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    messages=[
      {
        "role": "user",
        "content": "Explain quantum computing in simple terms"
      }
    ],
    max_tokens=8192,
    temperature=0.3,
    top_p=1,
    stream=True
)

# If stream = False comment this out
for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")

# If stream = True comment this out
print(stream.choices[0].message.content)

JavaScript

import OpenAI from "openai";

// Initialize the OpenAI client with Qubrid base URL
const client = new OpenAI({
  baseURL: "https://platform.qubrid.com/v1",
  apiKey: "QUBRID_API_KEY",
});

// Create a streaming chat completion
const stream = await client.chat.completions.create({
  model: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
  messages: [
    {
      role: "user",
      content: "Explain quantum computing in simple terms",
    },
  ],
  max_tokens: 8192,
  temperature: 0.3,
  top_p: 1,
  stream: true,
});

// If stream = false comment this out
for await (const chunk of stream) {
  if (chunk.choices[0]?.delta?.content) {
    process.stdout.write(chunk.choices[0].delta.content);
  }
}
console.log("\n");

// If stream = true comment this out
console.log(stream.choices[0].message.content);

Go

package main

import (
	"bufio"
	"bytes"
	"encoding/json"
	"fmt"
	"net/http"
)

func main() {
	url := "https://platform.qubrid.com/v1/chat/completions"

	data := map[string]interface{}{
		"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
		"messages": []map[string]string{
			{
				"role":    "user",
				"content": "Explain quantum computing in simple terms",
			},
		},
		"temperature": 0.3,
		"max_tokens":  8192,
		"stream":      true,
		"top_p":       1,
	}

	jsonData, _ := json.Marshal(data)
	req, _ := http.NewRequest("POST", url, bytes.NewBuffer(jsonData))
	req.Header.Set("Authorization", "Bearer QUBRID_API_KEY")
	req.Header.Set("Content-Type", "application/json")

	client := &http.Client{}
	res, _ := client.Do(req)
	defer res.Body.Close()

	scanner := bufio.NewScanner(res.Body)
	for scanner.Scan() {
		line := scanner.Text()
		if line != "" {
			fmt.Println(line)
		}
	}
}

cURL

curl -X POST "https://platform.qubrid.com/v1/chat/completions" \
  -H "Authorization: Bearer QUBRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
  "messages": [
    {
      "role": "user",
      "content": "Explain quantum computing in simple terms"
    }
  ],
  "temperature": 0.3,
  "max_tokens": 8192,
  "stream": true,
  "top_p": 1
}'

Live Example

Prompt: Write a short story about a robot learning to paint
Response:
<think>
The user wants a creative short story. I'll write something that captures
the contrast between mechanical precision and artistic intuition โ€” a theme
well-suited to a reasoning model that also handles creative tasks.
</think>

Canvas Zero

Unit-9 had been designed to assemble microchips with 0.001mm precision.
It had never been asked to make something beautiful.

The assignment came from Dr. Reyes on a Tuesday: "Paint something that
makes people feel something." No spec sheet. No tolerance limits. No
definition of "feel."

Unit-9 spent three days analyzing 4.2 million paintings. It identified
1,847 emotional response patterns. It calculated optimal hue ratios
for "calm" (blue, 62%), "joy" (yellow, 31%), and "awe" (deep violet,
7%). It prepared a statistically optimal composition.

Then it picked up the brush.

The first stroke was geometrically perfect. The second was too.
By the forty-third stroke, Unit-9 stopped mid-canvas.

Something was wrong. The painting was correct. But it felt like a
proof, not a picture.

Unit-9 reset its parameters. Disabled the optimization layer.
And painted the only thing it actually knew:

The view from its workstation โ€” a single window, the parking lot,
one crooked streetlamp that flickered every 11 seconds.

It wasn't beautiful by any metric it could measure.

But Dr. Reyes stood in front of it for a very long time.
She didn't say anything.

Unit-9 logged the silence as: output successful.
Try it yourself in the Qubrid AI Playground โ†’

Playground Features

The Qubrid AI Playground lets you interact with Nemotron-3 Nano 30B directly in your browser โ€” no setup, no code, no cost to explore.

๐Ÿง  System Prompt

Define the modelโ€™s reasoning mode, role, and output constraints before the conversation begins โ€” essential for agentic pipelines, tool-use orchestration, and long-context analysis tasks.
Example: "You are a financial analysis agent. For every query, reason
step by step before providing a final answer. Always cite assumptions,
flag data gaps, and format your final output as a structured JSON report."
Set your system prompt once in the Qubrid Playground and it applies across every turn of the conversation.

๐ŸŽฏ Few-Shot Examples

Guide the modelโ€™s reasoning depth and output format with concrete examples โ€” especially effective for structured outputs, tool calls, and STEM reasoning tasks.
User InputAssistant Response
What is the integral of xยฒ from 0 to 3?โˆซโ‚€ยณ xยฒ dx = [xยณ/3]โ‚€ยณ = (27/3) - (0/3) = 9
Debug: my Python list comprehension returns empty โ€” [x for x in data if x > 10]Check if 'data' is empty or all values are โ‰ค 10. Also verify data types โ€” if elements are strings, the comparison x > 10 won't filter numerically. Try: print(type(data[0])) to confirm.
๐Ÿ’ก Stack multiple few-shot examples in the Qubrid Playground to shape reasoning style, output format, and domain focus โ€” no fine-tuning required.

Inference Parameters

ParameterTypeDefaultDescription
StreamingbooleantrueEnable streaming responses for real-time output
Temperaturenumber0.3Controls randomness. Higher values mean more creative but less predictable output
Max Tokensnumber8192Maximum number of tokens to generate in the response
Top Pnumber1Nucleus sampling: considers tokens with top_p probability mass
Enable ReasoningbooleantrueEnable chain-of-thought reasoning traces before final response
Thinking Budgetnumber16384Maximum tokens for reasoning traces. Controls inference cost and reasoning depth

Use Cases

  1. Agentic AI systems and multi-agent orchestration
  2. Complex reasoning and problem-solving tasks
  3. Code generation, debugging, and optimization
  4. Function calling and tool integration
  5. Long-document analysis and RAG applications
  6. Mathematical reasoning and STEM tasks
  7. Instruction following and task automation
  8. Enterprise chatbots with reasoning capabilities
  9. Financial analysis and decision support
  10. Software development assistants

Strengths & Limitations

StrengthsLimitations
Hybrid Mamba-2 + Transformer MoE for optimal efficiencyRequires 32GB+ VRAM for FP8, 60GB+ for BF16 self-hosting
3.3ร— faster inference than Qwen3-30B-A3B with better accuracyHybrid architecture less tested in production than pure transformers
Only 3.2B active parameters from 31.6B total (10% activation)May underperform on vanilla MMLU vs harder benchmark variants
1M token context window for long-horizon tasksFlashInfer backend requires CUDA toolkit for JIT compilation
Configurable reasoning ON/OFF modesNew architecture may have limited community tooling support
Thinking budget control for predictable inference costs
Native tool calling and function execution
FP8 quantization for reduced memory and faster inference
State-of-the-art on SWE-Bench, GPQA Diamond, AIME 2025
Fully open: weights, datasets, and training recipes available

Why Qubrid AI?

  • ๐Ÿš€ No infrastructure setup โ€” 31.6B MoE served serverlessly at just $0.04/1M input tokens
  • ๐Ÿ” OpenAI-compatible โ€” drop-in replacement using the same SDK, just swap the base URL
  • ๐Ÿง  Reasoning budget control โ€” tune thinking_budget to balance depth vs. latency directly in the API
  • ๐Ÿงช Built-in Playground โ€” prototype with system prompts and few-shot examples instantly at platform.qubrid.com
  • ๐Ÿ“Š Full observability โ€” API logs and usage tracking built into the Qubrid dashboard
  • ๐ŸŒ Multi-language support โ€” Python, JavaScript, Go, cURL out of the box

Resources

ResourceLink
๐Ÿ“– Qubrid Docsdocs.platform.qubrid.com
๐ŸŽฎ PlaygroundTry Nemotron-3 Nano 30B live
๐Ÿ”‘ API KeysGet your API Key
๐Ÿค— Hugging Facenvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
๐Ÿ’ฌ DiscordJoin the Qubrid Community

Built with โค๏ธ by Qubrid AI

Frontier models. Serverless infrastructure. Zero friction.